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Abstract. We introduce a new method for building classification models when we have prior 
knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be 
distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. "soft- 
max") model, with a prior that introduces correlations between the parameters for classes that 
are nearby in the tree. We compare the performance on simulated data of the new method, the 
ordinary MNL model, and a model that uses the hierarchy in different way. We also test the new 
method on a document labelling problem, and find that it performs better than the other methods, 
particularly when the amount of training data is small. 

1 Introduction 

In this paper, we consider classification problems where classes have a hierarchical structure. The 
hierarchy reflects our prior opinion regarding similarity of classes. Two classes are considered 
similar if it is difficult to distinguish them from each other on the basis of the features available. 
The similarity of classes increases as we descend the hierarchy. 

One example of a hierarchical classification scheme is the annotations describing the biological 
function of genes. Functions are usually presented in a hierarchical form starting with very general 
classes and becoming more specific in lower levels of the hierarchy. There are many different gene 
annotation schemes. Figure ^ shows a small part of the scheme proposed by Rison 0. 

Another example of hierarchical classification is the document labelling problem. This involves 
classification of document regions to one of many predefined classes. It is possible to arrange these 
classes in a hierarchy such as that of Figure 0] below. We arranged these classes in a hierarchical 
form based on our guess as to how easily they can be distinguished. For instance, we considered 
"Figure Caption" and "Table Caption" similar to each other since they both usually contain small 
amounts of straight text and both are centre justified. 



1 



Gene 
Functions 




Small Cell Organism 

Energy Macromolecules Molecules Process Process 



Figure 1: A part of a gene annotation hierarchy. 



Bayesian models provide a framework that allows us to incorporate prior knowledge of this sort. 
The prior distribution in Bayesian models represents our belief (as well as our uncertainty) regarding 
likely values of parameters. For modelling hierarchical classes, we introduce a new method which 
is based on a Bayesian form of the multinomial logit (MNL) model. We use a prior that introduces 
correlations between the parameters of nearby classes. 

This paper is organized as folllows. In section 2, simple classification models and their extensions 
for analysing hierarchical classes are discussed. In section 3, using simulated data, we compare 
the performance of our model, the ordinary MNL model and an alternative model that uses the 
hierarchy in a different way. In section 4 we compare the same models on a document labelling 
problem. The last section summarizes our findings and presents some ideas for future research. 



2 Hierarchical classification 

Consider a classification problem in which we have observed data for n cases, (x^\y^), 
...,(x( n \ y( n )), where = x± \ ...,Xp^ is the vector of p covariates (features) for case i, and 
is the associated class. Our goal is to develop a model that assigns each data item to its correct 
class based on the observed covariates. The resulting model will be used to classify future cases for 
which the class membership is unknown but the covariates are available. For binary classification 
problems, a simple logistic model can be used: 

P(y = l\x,a,0) = — — ■ — -^r (1 

1 + exp(a + xp) 

Here, a is the intercept, (3 is a p x 1 vector of unknown parameters and x(3 is its inner product 
with the covariate vector. 

When there are three or more classes, we can use a generalization known as the multinomial 
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logit (MNL) model: 

where c is the number of classes. For each class, j, there is a vector of p unknown parameters 3 a. 
The entire set of regression coefficients 3 = 3\, ...,3 C can be presented asapxc matrix. This 
representation is redundant, since one of the /3j's can be set to zero without changing the set of 
relationships expressible with the model, but removing this redundancy would make it difficult to 
specify a prior that treats all classes symmetrically. For this model we can use the following priors: 

aj \T ~ iV(0,r 2 ) 
Pjilr ~ iV(0,r 2 ) 
Tq 2 ~ Gamma(ao,bo) 
Gamma(a, b) 



where j = 1, c and I = 1, ...,p. 

The MNL model treats classes as unrelated entities without any hierarchical structure. This 
is not always a realistic assumption. In many classification problems, like those discussed above, 
one can arrange classes in a hierarchical form analogous to the hierarchy of species arranged in 
genera, families, etc. If the classes have in fact the assumed structure, one would expect to obtain 
a higher performance by using this additional information. A special case is when the classes are 
ordered (e.g., education level). For these problems a more parsimonious model (e.g., cumulative 
logit model) with improved power can be used [2j. 

One approach for modelling hierarchical classes is to decompose the classification model into 
nested models (e^g., logistic or MNL). Nested MNL models are extensively discussed in econo- 
metrics (e.g., [3, |j|) in the context of estimating the probability of a person choosing a specific 
alternative (i.e., class) from a discrete set of options (e.g., different modes of transportation). 
These models, known as discrete choice models, aim at forecasting and explaining human decisions 
through optimizing an assumed utility (preference) function, which is different from our aim of 
maximizing classification accuracy. 

Goodman [jj showed that using hierarchical classes can significantly reduce the training time of 
maximum entropy-based language models and results in slightly lower perplexities. He illustrated 
his approach using a word labelling problem, and recommended that instead of predicting words 
directly, we can first predict the class the word belongs to, and then predict the word itself. This 
approach can be applied to the other learning techniques. 

Fox suggested using successive binary partitions (i.e., dichotomies) of classes and fitting a 
logistic model to each partition. In other words, we recursively split a set of classes into two 
mutually exclusive subsets where each subset contains similar classes. In Figure |2J these partitions 
are {12, 34}, {1, 2}, and {3, 4}. The resulting nested binary models are statistically independent. 
The likelihood can therefore be written as the product of the likelihoods for each of the binary 
models. For example, in Figure [21 we have 

P(y = l\x) = P(ye{l,2}\x)xP(ye{l}\ye{l,2},x) (3) 
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Restriction to binary models is unnecessary. At each level, classes can be divided into more than 
two subsets and MNL can be used instead of logistic regression. We refer to methods based on 
decomposing the tree structure into nested MNL models as treeMNL. Consider a parent node, m, 
with c m children nodes. The portion of the nested MNL model for this node has the following form: 

exp(a mfc + x(3 mk ) 



P(y — k\x, oc m , (3 n 



Ht=l eX P(«mfc + x Pmk) 



PmkllTm ~ N(0,T m ) 

Tq 2 ~ Gamma(ao,bo 
t~ 2 ~ Gamma(a m ,b m j 



where k = 1, c m and I = 1, ...,p. We calculate the probability of each end node, j, by multiplying 
the probabilities of all intermediate nodes leading to j. 

We introduce an alternative Bayesian framework for modelling hierarchical classes. Consider 
Figure |5J which shows a hierarchical classification problem with four classes. For each branch in 
the hierarchy, we define a different set of parameters. In Figured these parameters are denoted as 
0n and 0i2 for branches in the first level and 02i, 022> 023 and 024 for branches in the second 
level. We assign objects to one of the end nodes using a MNL model (equation |2J) whose regression 
coefficients for class j are represented by the sum of parameters on all the branches leading to that 
class. In Figure [21 these coefficients are (3\ = 0n + 02i, (3% = 0n + 022 , @3 = 012 + 031 and 
04 = 012 + 032 for classes 1,2,3 and 4 respectively. Sharing the common terms, 0n and 0i2, 
introduces prior correlation between the parameters of nearby classes in the hierarchy. 

In our model, henceforth called corMNL, 0's are vectors with the same size as /3's. We assume 
all the components of the 0's are independent, and have normal prior distributions with zero 
mean. The variance of the 0's is regarded as a hyperparameter, which controls the magnitude of 
coefficient parameters. When a part of the hierarchy is irrelevant, we hope the posterior distribution 
of its corresponding hyperparameter will be concentrated near zero, which results in the related 
parameters becoming close to zero. In figure |21 for example, when the hyperparameters in the first 
level become small (compare to the hyperparameters in the second level), the model reduces to 
simple MNL. In contrast, when these hyperparameters are relatively large, the model reinforces 
our assumption of hierarchical classes. For this hierarchy we use the following priors: 

aj \T ~ AT(0,r 2 ) j = l,...,4 

<\>mkl\T m ~ N(0,T^) 

Tq 2 ~ Gamma(ao,bo) 
t~ 2 ~ Gamma(a m ,b m ) 



Here, (j)mbi refers to the parameter related to covariate x\ and branch k of node m. As we can see, 
all the branches that share the same node are controlled by one hyperparameter. 

By introducing prior correlations between parameters for nearby classes, we can better handle 
situations in which these classes are hard to distinguish. If the hierarchy actually does provide 
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Class 1 Class 2 Class 3 Class 4 

Pi = 011 + 021 #2 = 011 + 022 A? = 012 + 023 Pi = 012 + 034 

Figure 2: A simple representation of our model. The coefficient parameters for each classes are 
presented as a sum of parameters at different level of hierarchy 

information about how distinguishable classes are, we expect that performance will be improved. 
This would be especially true when the training set is small and the prior has relatively more 
influence on the results. Using an inappropriate hierarchy will likely lead to worse performance than 
a standard MNL model, but since the hyperparameters can adapt to reduce the prior correlations 
to near zero, the penalty may not be large. 



3 Results for synthetic datasets 

So far, we have discussed three alternative models: MNL, treeMNL, and corMNL. We first com- 
pare these models using a synthetic four- way classification problem with two covariates. Data are 
generated from each of these models in turn, and then fit with each model in order to test the 
robustness of the models when applied to data generated from other models . 

All regression parameters are given normal priors with mean zero. For the MNL model, the 
standard deviation for all the intercepts, To, and the standard deviation for the rest of coefficients, 
r, have the following priors: 

t q 2 ~ Gamma(l, 10) (0.16,0.38,1.98) 
t -2 ~ Gamma(l,l) (0.52,1.20,6.27) 

We use the parameterization of the Gamma distribution in which Gamma(a, b) has density 
f(x\a,b) = [b a T(a)]~ 1 x a ~ 1 e~ x / b , for which the mean is ab and the standard deviation is a 1//2 6. 
We gave the 2.5, 50 and 97.5 percentiles of r in parenthesis. 

For treeMNL and corMNL models, we assume that classes are arranged in a hierarchical form 
as shown in Figure [2J This hierarchy implies that while it might be easy to distinguish between 
groups {1, 2} and {3, 4}, further separation of classes might not be as easy. As mentioned above, 
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the treeMNL model for this hierarchy is comprised of three nested logistic models. These models 
are: P(y G {1, 2}\a x , /3 1} x), P(y = l\a 2 ,f3 2 ,x,y G {1,2}) and P(y = 3|a 3 ,/3 3 ,x,y G {3,4}). For 
corMNL model, we use the priors discussed in section 2. The variance of regression parameters 
/3's in treeMNL and 0's in corMNL are regarded as hyperparameters. For these two models, one 
hyperparameter controls all the parameter emerging from the same node. These hyperparameters 
are given the following prior distributions: 

t 2 ~ Gamma{l, 10) (0.16,0.38,1.98) 

Tf 2 ~ Gamma(l,S) (0.23,0.54,2.82) 

t 2 - 2 ~ Gamma(l, 20) (0.05,0.12,0.63) 

T3- 2 ~ Gamma(l, 20) (0.05,0.12,0.63) 



Here, t±, T2 and T3 correspond to nodes 1, 2, and 3 respectively (FigureEJ). These parameters have 
a narrower prior compared to r in the MNL model. This is to account for the fact that the role of (3 
in the MNL model is played by more than one parameter in treeMNL and corMNL. Moreover, the 
regression parameters in the second level of hierarchy have a relatively smaller standard deviation 
r. As a result, these parameters tend to be smaller, making separation of class 1 from 2 and class 
3 from 4 more difficult. 

We do three tests, in which we assume that each of the MNL, treeMNL and corMNL is the 
correct model. This allow us to see how robust each model is when data actually come from a 
somewhat different model. For each test, we sample a set of parameters from the prior distribution 
of the corresponding model. Pairs of data items (x^',y^) are generated by first drawing 10000 
independent samples xf\x^ from the uniform(— 5, 5) distribution and then assigning each data 
item to one of the four possible classes. The assignment is either based on the multinomial model 
(for data generated from the MNL and corMNL) or based on successive logistic models (for data 
generated from the treeMNL). 

All three models are trained on the first 100 data items and tested on the remaining 9900 items. 
The regression coefficients were sampled from their posterior distribution using MCMC methods 
with single- variable slice sampling [jfl. At each iteration, we used the "stepping out" procedure 
to find the interval around the current point and used the "shrinkage" procedure for sampling 
from the interval. Since the hyperparameters were given conjugate priors, direct Gibbs sampling 
could be used for them. For all tests we drew 1000 samples from the posterior distributions. We 
discarded the initial 250 samples and used the rest for prediction. Performance is measured in 
terms of average log-probability and error rate on the test set. 

The above procedure was repeated 100 times. Each time, new regression parameters were sam- 
pled from priors and new pairs of data items were created based on the assumed models. The 
average results (over 100 iterations) are presented in Table ^ I n this table, each column corre- 
sponds to the model used for generating the data and each row corresponds to the model used for 
building the classifier. As we can see, the diagonal elements have the best performance in each 
column. That is, the model whose functional form matches the data generation mechanism per- 
forms significantly better than the other two models (all p-values based on average log-probability 
are less than 0.01 using a paired i-test with n = 100). Moreover, the results show that when 
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Table 1: Comparison of models on simulated data based on average log-probability using training 
sets of size N = 100 and N = 50. 
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Table 2: Comparison of models using a more complex hierarchy. 



the samples are generated according to the MNL model (i.e., classes are unrelated), corMNL has 
a significantly (p-value < 0.001) better performance compared to treeMNL. Reducing the size of 
training sets, N, to 50 increases the gap between the average performance of models but provides 
the same conclusions. The conclusions also remain the same when we use different priors and ranges 
of covariates. 

While statistically significant, the results presented in Tabled might not be significant for some 
practical purposes. This is mostly due to the simplicity of the hierarchical structure. Next, we 
repeated the above tests with a more complex hierarchy, as shown in Figure 01 For this prob- 
lem we used four covariates randomly generated from the uniform(0, 1) distribution. In all 
three models, we used the same prior as before for the intercepts. For the MNL model we set 
Gamma(l, 1). The hyperparameters of treeMNL and corMNL were given Gamma(l,5), 
Gamma(l, 20) and Gamma(l, 100) priors for the first, second and third level of the hierarchy 
respectively. 

Table [21 shows the average results over 100 datasets for each test. As we can see, the differences 
between models are more accentuated. Nevertheless, corMNL is still performing well even when 
the data are generated by other models, where corMNL is outperformed only by the true model. 
When data come from treeMNL, the results from corMNL are very close to those of the actual 
model (i.e., treeMNL). 
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Figure 3: A hypothetical hierarchy with a relatively more complex structure. 

4 Results on a document labelling problem 

We next test our approach using a dataset of the page images of 15 articles (472 pages) from the 
Journal of Machine Learning Research (JMLR) collected by Laven U Each page was segmented 
to several regions, and each region was manually classified to one of 24 possible classes. Figure 
presents these classes in a hierarchical form. The hierarchy is based on our belief regarding how 
difficult it is to separate classes from each other using the available features. 

The covariates are 59 different features such as the location of the region on the page and the 
density of the ink inside the region. We normalized all features so they have zero mean and standard 
deviation 1. 

Laven divided the dataset into a training set (including 10 articles with 3521 regions), and a 
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Figure 4: Hierarchical structure of document labels. 
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test set (including 5 articles with 2035 regions). The items from the same article are considered as 
independent even though they clearly are not. Although this may cause overfitting problems, we 
follow the same assumption for our initial test in order to make our results comparable to Laven's. 

We trained the three models (MNL, corMNL and treeMNL) on the training set and evaluated 
their performance on the test set. The coefficient parameters in the MNL models were given normal 
prior with mean zero. The variances of these parameters were regarded as hyperparameters. For this 
problem, since the number of covariates, p = 59, is relatively large, we use the Automatic Relevance 
Determination (ARD) method suggested by [9J. ARD employs a hierarchical prior to determine 
how relevant each covariate is in classification of objects. In this method, one hyperparameter, 07, 
is used to control the variance of all coefficients, /3ji (J = l,...,c), for covariate x\. If a covariate 
is irrelevant, its hyperparameter will tend to be small, forcing the coefficients for that covariate be 
near zero. As before, we also use one hyperparameter, r, to control all /3's in the MNL model. 
We set the standard deviation of (3ji equal to to[. Therefore, while o\ control the relevance of 
covariate X[ compared to other covariates, the scale parameter r, controls the overall usefulness of 
all covariates in separating classes. For the MNL model we used the following priors: 

aj\ <t ~ iV(0,r 2 ) j = l,..,24 

/%|t,<7j ~ N(0,r 2 af) Z = 1, ...,59 

t q 2 ~ Gamma(0.5,l) (0.63,2.09,46.31) 

t~ 2 ~ Gamma(0.5,20) (0.14,0.47,10.07) 

af 2 ~ Gamma(l, 10) (0.16,0.38,1.98) 

Similar priors are used for the parameters of treeMNL and corMNL. For these two models, we 
used one hyperparamter, af 2 ~ Gamma(l, 10) to control all parameters related to covariate xi. 
We also used one scale parameter T m 2 ~ Gcluiui(i(0.5, 100) for all parameters (/3's in treeMNL, <^>'s 
in corMNL) sharing the same node m. The prior for the intercepts was the same as in the MNL 
model. 

We used Hamiltonian dynamics ^(J f° r sampling from the posterior distribution of coefficients 
parameters. To reduce the random walk aspect of sampling procedure, we use a reasonably large 
number of leapfrog steps (L = 500). The stepsizes is set to 0.02 in order to maintain an acceptance 
rate of about 90%. In the MNL and corMNL models, the new updates are proposed for all regression 
parameters simultaneously. Nested MNL models in treeMNL are updated separately since they are 
regarded as independent models. The coefficient parameters within each nested model, however, 
are updated at the same time. We used single- variable slice sampling 0], as described above, for 
hyperparameters. The convergence of the Markov chain simulations were assessed by plotting the 
values of hyperparameters and average log-likelihood (on training cases). We ran each chain for 
2500 iterations, of which the first 500 were discarded. 

Table |3] compares the results from different models. As we can see, our MNL model has ex- 
actly the same performance as the likelihood-based model developed by Laven. The corMNL and 
treeMNL models have a slightly better error rate and a higher average log-probability compared to 
the MNL model. 

We speculated that the assumed structure would have a more profound effect when a smaller 
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9.4 
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-0.3998 
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treeMNL 


-0.3602 
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corMNL 


-0.3598 


9.2 



Table 3: Performance of models for the document labelling problem on the test set. 

sample is available. To test this idea, we reduced the size of the training set to 200, and used 10 
non-overlapping training sets in order to obtain a confidence interval for the performance of each 
model. To avoid problems from incorrectly assuming independence, we first combined the training 
and the test sets and then randomly sampled (without replacement) 10 training sets each with 200 
cases and used the remaining 3556 cases as the test set. Sampling was performed without regard to 
which document each region comes from, which artificially makes the independence assumption be 
true. We trained the models on each training set and measured the performance on the test set. We 
discarded a burn-in period of 500 initial updates and retained the next 3500 updates for prediction. 
The results are shown in Table |1J Since the test set is large, the observed variation in the results 
is dominated by the difference between training sets, and we can use paired i-tests (n = 10) for 
comparing the results. Based on the average log-probability results, the corMNL model outperforms 
both MNL (p-value = 0.002 ) and treeMNL (p-value = 0.004) models. The improvement on error 
rate is only slightly significant compared to MNL (p-value = 0.09) and treeMNL (p-value = 0.06). 
The differences in performance of MNL and treeMNL are not statistically significant. 
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MNL 


-0.5441 


15.4 


treeMNL 


-0.5381 


15.5 


corMNL 


-0.5160 


14.9 



Table 4: Performance of models for the document labelling problem on the test set using 10 non- 
overlapping training sets of size 200. 



5 Conclusions and future directions 

In this paper, we have introduced a new approach for modelling hierarchical classes. Our analysis 
shows that when the hierarchy actually does provide information regarding the similarity of classes, 
our approach outperforms simple MNL model and models based on decomposing the hierarchy into 
nested MNL models. Our method can be applied to many classification problems where there is 
prior knowledge regarding the structure of classes. One such problem, which we hope to investigate 
soon, is annotation of gene functions. 

So far, we have focused only on simple tree- like structures. There are other hierarchical structures 
that are more complex than a tree. For example, one of the most commonly used gene annotation 
schemes, known as Gene Ontology (GO), is implemented as a directed acyclic graph (DAG). In 
this structure a node can have more than one parent. Our method, as it is, cannot be applied to 
these problems, but it should be possible to extend the idea of summing coefficients along a path 
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to the class in order to allow for multiple paths. 

In our approach, we considered only one structure for each hierarchical classification problem. 
However, we might sometimes be able to think of more than one possible class hierarchy. We might 
then consider all possible hierarchies as equally likely and form an ensemble model based on this 
assumption (as done in but this might not be easy when the number of classes is large. Also, 
in most cases we have prior knowledge that leads us to prefer some structures over others a priori. 
It is possible to generalize our method to multiple hierarchies. As for the generalization to DAG's, 
it should be possible to sum coefficients along the multiple paths within different hierarchies. We 
can further use a set of hyperparameters to discover the relevance of each hierarchy. 

The results presented in this paper are for linear models. We expect that a similar approach can 
be used in non-linear models such as neural networks. 
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